Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran, Zhong, Xiaojun Chang, Meng Wang

TL;DR
This paper introduces the open-vocabulary audio-visual event localization task, a new dataset, and baseline methods to recognize and localize both seen and unseen events in videos, advancing the field beyond closed-set limitations.
Contribution
The paper proposes the OV-AVEBench dataset and evaluation metrics, establishing a new open-vocabulary AVEL task with baseline approaches using pretrained multimodal features.
Findings
Baseline methods demonstrate the feasibility of open-vocabulary event localization.
The dataset enables evaluation of models on both seen and unseen event categories.
Pretrained multimodal features improve zero-shot and few-shot localization performance.
Abstract
The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown'', but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Image Retrieval and Classification Techniques
