RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video   Retrieval

Burak Satar; Hongyuan Zhu; Hanwang Zhang; Joo Hwee Lim

arXiv:2206.12845·cs.CV·June 28, 2022·1 cites

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

PDF

Open Access 1 Repo

TL;DR

RoME introduces a role-aware mixture-of-experts transformer that disentangles and models spatial, temporal, and object contexts in videos and text, significantly improving text-to-video retrieval accuracy without pre-training.

Contribution

The paper presents a novel transformer-based model that explicitly disentangles different contextual roles in videos and text, leveraging mixture-of-experts to capture inter-modality correlations.

Findings

01

Outperforms state-of-the-art on YouCook2 and MSR-VTT datasets.

02

Effectively models spatial, temporal, and object contexts.

03

No pre-training required for competitive results.

Abstract

Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

buraksatar/RoME_video_retrieval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsRank-One Model Editing