Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang; Mang Ling Ada Fok; Jialu Ma; Yan Xia and; Daniel Cremers; Philip Torr; Volker Tresp; Jindong Gu

arXiv:2406.10079·cs.CV·November 22, 2024

Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia and, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

PDF

Open Access

TL;DR

This paper introduces a new benchmark and methods for localizing events in videos using multimodal queries that combine images and text, aiming to improve video understanding and search applications.

Contribution

It presents ICQ, a novel benchmark for multimodal query-based video localization, along with adaptation methods and a surrogate fine-tuning strategy, filling a gap in current research.

Findings

01

Multimodal queries significantly enhance video event localization.

02

The benchmark evaluates 12 state-of-the-art models across diverse domains.

03

Proposed methods improve model adaptation to multimodal queries.

Abstract

Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques