Multi-event Video-Text Retrieval

Gengyuan Zhang; Jisen Ren; Jindong Gu; Volker Tresp

arXiv:2308.11551·cs.CV·January 23, 2026

Multi-event Video-Text Retrieval

Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the Multi-event Video-Text Retrieval (MeVTR) task to handle videos with multiple events and proposes a simple yet effective model, Me-Retriever, that outperforms existing models in this new setting.

Contribution

The paper defines the MeVTR task for multi-event videos and proposes the Me-Retriever model with a novel MeVTR loss, addressing a gap in current video-text retrieval methods.

Findings

01

Me-Retriever outperforms existing models on MeVTR benchmarks.

02

The proposed model effectively handles videos with multiple events.

03

The work establishes a new baseline for multi-event video-text retrieval.

Abstract

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gengyuanmax/mevtr
pytorchOfficial

Videos

Multi-Event Video-Text Retrieval· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning