Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou; Dan Guo; Ruohao Guo; Yuxin Mao; Jingjing Hu; Yiran; Zhong; Xiaojun Chang; Meng Wang

arXiv:2411.11278·cs.CV·March 12, 2025·2 cites

Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran, Zhong, Xiaojun Chang, Meng Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces the open-vocabulary audio-visual event localization task, a new dataset, and baseline methods to recognize and localize both seen and unseen events in videos, advancing the field beyond closed-set limitations.

Contribution

The paper proposes the OV-AVEBench dataset and evaluation metrics, establishing a new open-vocabulary AVEL task with baseline approaches using pretrained multimodal features.

Findings

01

Baseline methods demonstrate the feasibility of open-vocabulary event localization.

02

The dataset enables evaluation of models on both seen and unseen event categories.

03

Pretrained multimodal features improve zero-shot and few-shot localization performance.

Abstract

The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown'', but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasongief/ov-avel
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Image Retrieval and Classification Techniques