RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

TL;DR
RoME introduces a role-aware mixture-of-experts transformer that disentangles and models spatial, temporal, and object contexts in videos and text, significantly improving text-to-video retrieval accuracy without pre-training.
Contribution
The paper presents a novel transformer-based model that explicitly disentangles different contextual roles in videos and text, leveraging mixture-of-experts to capture inter-modality correlations.
Findings
Outperforms state-of-the-art on YouCook2 and MSR-VTT datasets.
Effectively models spatial, temporal, and object contexts.
No pre-training required for competitive results.
Abstract
Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsRank-One Model Editing
